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Abstract 

This paper provides a short introduction to the group testing problem, 
and reviews various aspects of its statistical physics formulation. Two 
main issues are discussed: the optimal design of pools used in a two- 
stage testing experiment, like the one often used in medical or biological 
applications, and the inference problem of detecting defective items based 
on pool diagnosis. The paper is largely based on: M. Mezard and C. 
Toninelli, larXiv:0706.31041 . and M. Mezard and M. Tarzia Phys. Rev. E 
76, 041124 (2007). 

1 Introduction 

Group testing dates back to 1943, when Dorfman suggested to use it for testing 
whether US draftees had syphilis pQ. Instead of testing each individual blood 
sample, the idea is to mix the blood and test pools. In a group of N soldiers, 
one can first test N/k pools of k individuals. Then one focuses on the infected 
pools and performs a second stage of tests on all the individuals belonging to 
these infected pools. Assuming that each soldier is infected with probability p 
(and that the infections are uncorrelated), the total expected number of tests is 

Er = f + [!-(! -P)lf fc - C 1 ) 

Minimizing this expression over k, one finds that the optimal size of the pools 
is k ~ lf-y/p, giving an expected number of tests ET ~ 2^/pN. If the prevalence 
of infection is small, p <C 1, Dorfman's proposal reduces the total number of 
blood tests, compared to the individual tests, by a factor 2^/p. 

The general problem of group testing [5] is that of identifying defectives 
in a set of items by a series of tests on pools of items, where each test only 
detects whether there exists or not at least a defective item in the pool. It has 
numerous applications. In particular it is being used in building physical maps 
of the genome, by detecting whether a special target subsequence of bases is 
present in a DNA strand [3J Sj . But it has also been suggested to use it in HIV 
detection [5], in detecting failures in distributed computation [B], or in data 
gathering in sensor networks [7]. 
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We shall subdivide the group testing problem into two main topics: the pool 
design (building optimal pools, exploiting the possibility to have overlapping 
tests), and the inference problem (how to detect defective items, given the results 
of the pool tests). Two other important classification patterns are the number of 
stages of detection, and whether one needs to be sure of the result (assuming that 
the tests are perfect). For instance Dorfman's design is a two-stage algorithm 
with sure result. In a first stage one tests the N/k pools, and based on the 
results of this first stage one designs the second stage of tests, namely the list of 
all individuals whose blood sample belonged to a defective pool in the first stage. 
Getting sure results (in contrast to results that hold with high probability) is 
often needed in medical applications. In order to obtain them, one ends a group 
testing procedure by a final stage (the second stage in Dorfman's procedure) 
which tests individually all the items for which the procedure did not give a 
sure answer. 

In all our study we suppose that the status ('defective' or 'OK') of all the 
N items under study are iid random variables: each item can be defective with 
a probability p, or OK with probability 1 — p. Furthermore we assume that the 
value of p is known. This framework is called 'probabilistic group testing' in 
the literature. Another framework has been studied a lot, that of combinatorial 
group testing where the number of defective items is supposed to be known. 
Reviews on these two frameworks can be found in [21 [8] . 

Clearly one can detect defectives more efficiently (with less tests) when more 
stages can be done. How efficient can one be? Information theory provides an 
easy lower bound, which applies to the situation where there is no limit on the 
allowed number of stages. Assume you have a total of N items, and you perform 
T tests altogether. There are 2 T possible outcomes. If there were d defectives, 
a necessary condition to detect them is that 2 T > In the large N limit, 

taking d = Np, one finds that the number of tests T must be larger than: 

T>NH 2 (p)=N(-plog 2 p-(l-p)log 2 (l-p)) . (2) 

It is not difficult to design a sequence of pools, with an unbounded number of 
stages, which basically reaches this limit. 

Clearly, when p is small, the minimal number of tests in unbounded num- 
ber of stage, N(—p\og 2 p), is much smaller than Dorfman's two-stage result 
2N^/p. A natural question is that of the minimal number of tests, and the 
corresponding best pool design, if the number of stages is limited to a value 
S. In the next section we give the answer when S = 2, for the case of sure re- 
sults in two-stage procedures. Amazingly there exist pool designs which require 
only NH 2 (p) / log 2 tests, a factor 1.4 larger than the optimal unbounded-stage 
result. 

2 Optimal two-stage design in the small p limit 

In their nice analysis of two-stage group testing, Berger and Levenshtein 
[TU] suggest to study the case where the number of items N goes to infinity, 
and the probability of being defective p goes to zero like p = . In this 
limit they obtain the following bounds for the minimal expected number of 
tests required to find all the defectives in two stages, when (3 < 1: l/log2 < 
Hm N ^ oc T(N,p)/(Np\logp\)<^. 
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The recent work of [11] has derived the exact asymptotic minimal expected 
number of tests, and the pool design which reaches them, when (3 < 1/2 as 
well as in the limit p — > after N — > oo. In order to state the results in a 
compact form, it is convenient to introduce the notation by limjv_>oo|/3 for the 
limit where N goes to oo, p goes to zero, with p = and (3 > 0. The limit 
lirrip^o limAr^oo will be referred to as the (3 = case. 

The two main results of [IT] are the following. 

1) If < (3 < 1/2: _ 

1^ = ^- (3) 
Np\ logpl (log2) 2 

2) In these limits, "regular-regular" pools of girth > 6, with tests of degree 
K = log2/p and variables of degree L = |logp|/log2 become optimal with 
probability tending to one when N — + oo\[3. 

This pooling design is best understood in terms of its factor graph represen- 
tation. One builds a graph where there are two types of vertices: each variable 
is a vertex (represented by a circle in figure [JJ), and each test is also a vertex 
(represented by a square). An edge is present in the graph, between variable 
i and test a, whenever the variable i appears in test a. The graph is thus bi- 
partite, with edges only between variables and tests. The regular-regular pools 
correspond to random factor graphs, uniformly drawn from the set of graphs 
where every variable has degree L and every test has degree K, and such that 
the girth (the size of the shortest loop) is at least 6. The existence of such 
graph has been demonstrated by Lu and Moura [12] . Their construction is a 
bit complicated, but a simpler class of graphs has also been shown in [TT] to 
reach optimal performance in the asymptotic limit p — ► 0\(3, with < (3 < 1/2. 
These are the so-called "Regular-Poisson" graphs where each variable chooses 
randomly the L = | logp|/log2 tests to which it belongs, uniformly among all 
the (^) possible sets of L tests. In such a case the degrees of the tests become 
asymptotically Poisson distributed, with mean NL/M. 

The idea of the proof goes in two steps: a general lower bound of combina- 
torial nature, and a detailed analysis of the previous two random pools designs, 
showing that their number of tests asymptotically matches the lower bound. 



2.1 Lower bound 

The lower bound E T/(Np\ logp|) > l/(log2) 2 is obtained as follows. Any pool 
design is characterized by a graph, and therefore by a connectivity matrix c 
with element Ci a = 1 if item i belongs to pool a, Ci a = otherwise. Given a 
graph and an item % with Xi = 0, let us find out the condition for this to be 
an undetected 0. This situation occurs whenever any test containing i contains 
at least one item j which is defective (see figure [I) . If the girth of the graph 
is larger than 6, the values of the variables Xj in these neighbouring checks are 
uncorrelated, and the expected number of undetected in the first stage, for a 
given graph, is 

N M 
i=l o=l 

In general graphs (without any girth condition), one can use a Fortuin-Kasteleyn- 
Ginibre inequality [13j to show that Uq is always a lower bound to the expected 
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Figure 1: The site i is an undetected whenever all the tests containing i 
contains at least one item j which is defective (coloured here). 



number of undetected in the first stage. Then one minimizes the Uo of eq.(j4]) 
over all graphs. This is done using the function /(m) which is the fraction of 
sites such that, among its neighbouring checks, mi have degree 1, mi have de- 
gree 2, etc... Both the total number of checks, and I/o, can be written as linear 
expressions in f(m). Minimization over / is thus easily done. 



2.2 Upper bound 

It turns out that in the small p limit, the undetected Os are the dominant sources 
of errors. One way to see this is through the study of random graph ensembles. 

Imagine that the factor graph is generated from an ensemble of graphs where 
the degrees of the variables and checks are random variables drawn randomly 
from some fixed distribution. We shall adopt the usual notations from the 
coding community for describing these degree sequence: 

• An item has degree I with probability . The sequence of Kg is encoded 
in the polynomial A[x] = ^A;r. 

• A test has degree k with probability Pk- This is encoded in P[x] = 

It is also useful to introduce the 'edge perspective degree profiles'. Let Xe 
be the probability that, when one picks an edge at random in the factor graph, 
the variable to which it is attached has degree £, and pk be the probability 
that the test to which it is attached has degree k. Then = ^A£/(^ n nA„) 
and pk = kPk/(^2 n nP n ). These distributions are encoded in the functions 
= J^e^x 1 ' 1 and p[x] = J2kP^ xk l - 

Assuming that we have generated a random graph with girth > 6, the various 
quantities that appear in the computation of the expected number of tests can 
be expressed in terms of the generating functions. 

• The total number ofpools in the first stage is: G = N{£)/{k) = NA'[1]/P'[i\. 

• The number of sure OK items (girth > 6) detected in the first stage is: 
JV = JV(l-p)(l-A[l-/>[l-p]]). 

• Number of sure defective items (girth > 6) detected in the first stage is: 
Ni = Np (l - A[l - p[{\ - P )(l- A[l - p[l - p]}] 
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Figure 2: Expected number of tests in a sure group testing with a two stage 
procedure: optimal performance obtained in the optimal regular-regular ran- 
dom pool design. The graph gives the expected number of tests, divided by 
the information theoretic lower bound for arbitrary number of stage (NH2(p)), 
plotted versus log p. The non-analyticity points correspond to the values of p 
where the optimal value of the degree pair L, K changes. In the small p limit 
the curve goes asymptotically to l/log2. 

• All the items which are not detected after the first stage must be tested 
individually in the second stage. Therefore the total expected number of 
tests is: 

ET = G + N- (N +Nx). (5) 

This expression is to be minimized over the degree distributions X[x] and 
p[x]. 

Some simple probability distributions can be studied efficiently. For instance, 
the regular regular one, parameterized by A [a;] — x L and P[x] = x K leads to a 
simple result for ET in ([5]) . This can be optimized with respect to K and L for 
any p. figure [5] shows the result. 

When p — ► 0, it is easy to prove that the optimal values of K and L are 
K*(p) = an d L*(p) = —^2, and that they saturate the previous lower 
bound thus providing the exact asymptotic value of the minimal (over all two- 
stage procedures) number of expected tests: T = N [— p\og 2 P) / log 2. This is 
also the case of regular Poisson graphs. 

For finite p, the best degree sequences A, P are not known. However we have 
proved that at most 3 coefficients A^, and at most 5 coefficients P r are non- 
zero in these optimal sequences. Plugging this information in some numerical 
minimization procedure of l|5]). we have observed numerically that for most 
values of p the optimal degree sequence seems to be the regular-regular one. 
There are also some values where the optimal graph is slightly more complicated. 
For instance for p = .03, the best sequences we found are A [a;] = x A and P[x] = 
.45164 x 21 + .54836 x 22 , giving ET = .25450 , slightly better than the one 
obtained with A[x] = x A and P[x] — x 22 , giving ET = .25454 . But for all values 
of p we have explored, we have always found that either the regular-regular graph 
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is optimal, or the optimal graph has superposition of two neighbouring degrees 
of the variables, as in this p = .03 case. In any case regular- regular is always 
very close to the optimal structure. 

3 One stage group testing: inference 

Another interesting aspect of group testing, which we now discuss, is the identi- 
fication of defective items, given a pool design and the results of the tests. This 
amounts to minimize the number of errors in a one stage experiment. Given a 
set of pools and the corresponding tests' results, the identification of the most 
probable status of each variable is a typical inference problem that can be for- 
malized as follows. Imagine that the items status are given by y = y\, . . . , yjv, 
where yi = 1 if item i is faulty, j/j = if it is OK. Then the test a returns a signal 
t a = T a (y) = 1 if YlieV(a) Vi — 1' otherwise t a = 0. Given these test results, one 
can compute the probability that the items status are given by x — X\ , . . . , x^v ■ 
This is given by: 

i N 

i—1 a 

Our task is to find the configuration x* which maximizes this probability. The 
detection error will be measured by how much x* differs from y. 

In order to find x*, we first use the fact that, whenever a test a returns a 
value T a (y) = 0, we are sure that all the variables i belonging to pool a are 
sure 0: y% = 0. Therefore we know that x* — 0. This simple remark leads us 
to a 'graph stripping' procedure: We can take away from the graph all tests a 
such that t a = 0, and all the variables in these pools: they are sure 0s. The 
remaining 'reduced graph' has only tests a with t a = 1 . With some slight abuse 
of notation, let us call x the set of variables which remain in this reduced graph 
(their number is N' < N). The probability distribution on these remaining 
variables can be written as 

1 N ' 

Q{x) = ^jl[l(l-p)S Xl ,o+p5 Xl ,i} Y[l[T a (x) = l}. (7) 

i—1 a 

The reduced problem can thus be formulated as follows: Find the values of 
xi, . . .xjv' such that: 

• For each test a in the reduced graph, there is at least one of the variables 
in its pool that is defective. 

• The total number of defective variables should be minimized. 

This problem is a version of the celebrated vertex cover problem [TH [XSJ US] to 
the case of a hyper-graph. It is known as the hitting set problem. In the next 
section we discuss the statistical physics of this problem. 

4 Hitting set 

The hitting set problem is an interesting problem in itself. In order to get some 
experience about it, we have studied in [18j the hitting set problem in case 
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of random regular hyper-graphs where tests have degree K and variables have 
degree L. We define the weight of a configuration as A({xi}) = Yli=i x i- The 
Boltzmann-Gibbs measure of the problem is defined as 

P{x) = (1/Z)e-» A «*M J] I [T a (x) = 1] . (8) 

a 

We first write the Belief Propagation (BP) equations for this problem. Given a 
graph and a variable i, we consider a sub-graph rooted in i obtained by removing 
the edge between i and one of its neighbouring tests, a. Define Z^^ 11 ^ and 
Z^^°^ as the partition functions of this sub-graph restricted to configurations 
where the variable Xi is respectively OK {xi = 0) or defective (xt = 1). If the 
underlying graph is a tree, these two numbers can be computed recursively as 
follows: 

b£di\a 

z { r a) = n y ^ i] ( io ) 

b£di\a 



n (z^ a) + . (12) 



j£da\i 



Belief propagation amounts to using these equations on our problem, even if 
the graph is not a tree. The equations can be simplified by introducing two local 
cavity fields on each edge of the graph, defined as: e^ hi ^ a = Z^ a ^ /{Z^ a ' 1 + 
Zf^ a) ), and e^ u — = y^^/yC^i). They become: 

e M™ = °*P(-/0 (13) 



exp(-/z) + exp (J2b£di\a Wb^t 
1- Yl (l-e" h '— ). 



j€da\i 

The replica symmetric (RS) solution to this problem amounts to assuming that 
there is a unique solution to this equation, and in the case of random regular 
graph it must be a translation invariant solution: h^ a = h^s and v a ^i — vrsi 
V(«, a), with: 



HVrs = In < 1 



e n{L-i)-u RS nK-i- 



(14) 



Solving these equations, one can obtain the density of defective items as well as 
the entropy of the system using the usual formulas for the RS Bethe free energy 
(see for instance [IS]). Figure [3] shows the results for two values of the degree 
pairs L, K. This shows that the RS solution fails at high chemical potential and 
low density of active items, at least for L = 6 and K = 12, because it obtains a 
negative entropy. 
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Figure 3: Density of active items, p, and entropy density, s = S/N, as a function 
of the chemical potential, ^i, in the RS solution of the hitting set problem for 
L = 2 and K = 6 (left panel) and for L = 6 and K — 12 (right panel). 



A one step replica symmetry breaking (1RSB) solution can be constructed 
along the lines of reference [IH] (see also [50] for an exactly solvable case) . They 
take a particularly simple form in the large /j, limit. In this limit, the survey 
propagation (SP) equations [Hl[52] can be written in terms of one message Uq^ 1 
per edge of the graph. The SP equations for the hitting set problem are |18j : 




Hbedj\a u o 



(l-e-")II 



b- 

b£dj\a U 



(15) 



For random regular graphs, the 1RSB solution can be obtained by assuming 
that Uq~^ 1 is translation invariant, uf^ 1 — uq. Equation (TT5|) can be solved 
easily, and from this solution one can compute, using the technique of |19j . the 
complexity function. In the present case, this function gives 1/N' times the 
logarithm of the number of clusters of solutions, versus the optimal density of 
the cluster p. Figure 2] gives the result for the case L = 4 and K = 8. The 
value of the density where the complexity goes to zero gives the minimal density 
such that a hitting set exists. The 1RSB solution is stable to further replica 
symmetry breaking effects for a wide range of values of the degree pairs L, K. 
Figure [5] summarizes the nature of the low density phase when one varies L and 
K. 



4.1 Survey propagation and survey inspired decimation 

As proposed first in |21j , the equations obtained from the cavity method can be 
applied to a single instance of the inference problem, and turned into efficient 
algorithms. We have studied in [18] two decimation algorithms, one based on 
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Figure 4: Complexity £ as a function of the density of active variables p for 
L = 4 and K = 8. p unC ov (where £ = 0) is the minimal covering density. Below 
Puncov it is not possible to find solutions. 
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Figure 5: Phase diagram of the Hitting set problem. Squares, circles and trian- 
gles correspond, respectively, to the values of L and K for which the minimal 
hitting set configurations are obtained in a RS phase, a 1RSB phase , and a 
full RSB phase. For the cases RS and 1RSB, we have obtained a closed form 
expression for the value of the minimal density p. 
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the BP equations (fT5|) and one based on the SP equations (f]~5|) . In both cases 
the strategy is the same: one iterates the equations in parallel, starting from a 
random initial condition. If a fixed point is reached, one computes the degree 
of polarization pi of each variable, pi measures to what extent the marginal 
probability distribution of variable i is biased, either towards Xi = or towards 
Xi = 1. In the BP case, pi is defined as 



Pi 



exp(-//) 



exp( 



-H) + cxp (Ebedi 
In the case of SP, it is defined as 



(16) 



Pi 



(i-n 



b£di U 



a 



-v 



bedi U 



(17) 



The idea of the BP (or SP) inspired decimation algorithm is to identify the 
most polarized variable from BP (or SP), and fix its value X4 to its most probable 
value. Then variable Xi is removed from the graph; if X4 — 1 the tests connected 
to i are also removed. This procedure is then iterated until all variables are 
fixed. A subtle issue concerns the values of /z chosen in BP (resp. the value of 
y chosen in SP). In order to get a better convergence, we compute the entropy 
versus /i in the BP case (resp. the complexity versus y in the SP case), and fix 
the value of fi (resp. y) to the largest value such that the entropy (resp. the 
complexity) is positive. In this way /i (resp. y) evolves during the decimation 
procedure. 

In order to get some point of comparison, we have compared the BP and 
SP inspired decimation to a greedy algorithm, simply defined by the iteration 
of the procedure: find the variable i of largest degree, fix it to X4 = 1, clean the 
graph. On one instance of a random regular graph with L = 4, K = 6, N — 
12288, these three algorithms have obtained some hitting sets with the following 
minimal densities: 

• Greedy algorithm: p ~ 0.212. 

• BP inspired decimation: p ~ 0.186. 

• SP inspired decimation: p ~ 0.182. 

Notice that the prediction from the previous section states that, for L = 4, K — 
6, the minimal density necessary to obtain a hitting set, for an infinite graph, 
should be p — 0.178. On this example, and in various other experiments that 
we have tried, SP inspired decimation performs slightly better than other algo- 
rithms. It would be interesting to extend such comparisons more systematically. 



5 Perspectives 

Group Testing offers a variety of interesting questions, some of which also have 
some practical relevance. One of the results that could turn out to be important 
is the fact that the message passing approaches to the group testing inference 
problem seem to be fast and efficient. They are also easily generalizable to 
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the case of imperfect tests. One can expect that this will be useful in realistic 
applications of group testing. 

From the point of view of statistical physics, the search for hitting sets gives a 
new class of problems which exhibits in many cases the general pattern of 1RSB. 
In these cases, the hyper- vertex cover problem is thus under much better control 
than the usual vertex cover (which exhibits full RSB). Actually, technically these 
problems are rather simple to solve even at the 1RSB level. They could thus 
offer an interesting practice field to develop mathematical tools. 

We thank Irina Rish, Greg Sorkin and Lenka Zdeborova for interesting and 
stimulating discussions. This work has been supported in part by the 'EVER- 
GROW EC consortium in the FP6-IST program. 
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